BADASS: BActeriocin-Diversity ASsessment Software

Background Bacteriocins are defined as thermolabile peptides produced by bacteria with biological activity against taxonomically related species. These antimicrobial peptides have a wide application including disease treatment, food conservation, and probiotics. However, even with a large industrial and biotechnological application potential, these peptides are still poorly studied and explored. BADASS is software with a user-friendly graphical interface applied to the search and analysis of bacteriocin diversity in whole-metagenome shotgun sequencing data. Results The search for bacteriocin sequences is performed with tools such as BLAST or DIAMOND using the BAGEL4 database as a reference. The putative bacteriocin sequences identified are used to determine the abundance and richness of the three classes of bacteriocins. Abundance is calculated by comparing the reads identified as bacteriocins to the reads identified as 16S rRNA gene using SILVA database as a reference. BADASS has a complete pipeline that starts with the quality assessment of the raw data. At the end of the analysis, BADASS generates several plots of richness and abundance automatically as well as tabular files containing information about the main bacteriocins detected. The user is able to change the main parameters of the analysis in the graphical interface. To demonstrate how the software works, we used four datasets from WMS studies using default parameters. Lantibiotics were the most abundant bacteriocins in the four datasets. This class of bacteriocin is commonly produced by Streptomyces sp. Conclusions With a user-friendly graphical interface and a complete pipeline, BADASS proved to be a powerful tool for prospecting bacteriocin sequences in Whole-Metagenome Shotgun Sequencing (WMS) data. This tool is publicly available at https://sourceforge.net/projects/badass/.


Background
Characterization of bioactive molecules produced by free-living microorganisms has been very important in recent years because of their biotechnological applications. It is well known that the overwhelming majority of free-living microorganisms are not capable of being grown in laboratory conditions [1], which is a bottleneck to the identification and isolation of bioactive compounds. Thus, an alternative to search for new compounds is the well-established method of WMS, where the nucleic acids of the microbial community are extracted and sequenced directly from environmental samples [2]. Thus, genes involved in the synthesis of peptides or non-peptides bioactive compounds can be assessed. The main bottleneck lies in the development of user-friendly tools that allow the user to analyze a large amount of data in a simple and interactive way.
Class I consists of peptides that after their translation, undergo structural changes. This class is also called lantibiotics. They have a molecular weight below 5 kDa and a size smaller than 28 amino acids [12,13]. Class II is characterized by peptides that do not undergo post-translational modifications. They are larger than class I bacteriocins and have a molecular weight below 10 kDa [14]. Class III is composed of peptides with a molecular weight higher than 30 kDa. Bacteriocins of this class have a mechanism of action different from the other two classes, eliminating bacterial cells through cell wall hydrolysis [12,15,16].
A variety of software has been developed to search for ARGs or secondary metabolites such as nonribosomal peptide synthase (NRPS) or polyketide synthase (PKS) in WMS data [6,17]. However, none of this software is focused on the prospecting of bacteriocins. Anti-SMASH [18] is an excellent tool to analyze genomes while RiPPER [19] works better for pan-genome data. BAGEL web tool (BActeriocin GEnome mining tooL) [17] is one of the first tools developed for the identification of peptides and bacteriocins in genome data. However, the tool has a maximum size for the input file, making difficult the analysis of WMS data.
In this article, we present BADASS software (BActeriocin-Diversity ASSessment Software), an automated pipeline with an intuitive graphical interface that allows users to analyze the diversity of bacteriocins using WMS raw data. Diversity measurement is based on the abundance and richness of the three bacteriocin classes currently described. The software is available at https:// sourc eforge. net/ proje cts/ badass/.

Pipeline
The pipeline of BADASS ( Fig. 1) starts with the automatic loading of bacteriocin sequences to the database. It is worth noting that this process needs to be executed only on the first use of the software. The input file consists of a WMS sequencing sample in FASTQ format. After saving the project the following steps are performed.
(a) Quality assessment The user can choose to evaluate the raw data with a boxplot chart that correlates the Phred score of a base (y axis) with base position (x axis). This is an optional step that helps users to decide about the quality filter values that will be used in the next step. A boxplot with the result of the FastQC analysis is produced at this stage and displayed to the user. (b) Trimming and quality filter Raw data is trimmed to remove bases at the end of the reads with a Phred score below the cut-off value provided by the user. Subsequently, sequences are filtered according to parameters such as alignment score and e-value. The Fastx Toolkit software is used in this step. (c) Parser fastq2fasta The trimmed and filtered file is converted into FASTA format. (d) Mapping against the bacteriocin database In order to identify bacteriocins, we adapted a search method used in several works [20,21] which consists of: Firstly, a database is built using non-redundant BAGEL4 sequences [22]. Subsequently, the BLAST+ tool [22,23] is used to compare the translated reads against the BAGEL4 database with blastx. The best hit for each read identified as bacteriocin is retrieved. The user can define an e-value cut-off for the homology analysis. (e) Mapping against the 16S rRNA SILVA database The same file of the previous step is used to align the reads against the SILVA database using DIAMOND [24]. Two files in.csv format are generated. The first contains the list of subject nucleotide sequences with their respective identity values. The other file contains the best hit for each query sequence based on an identity cut-off value provided by the user. Cut-off values are adjustable in the graphical interface of BADASS.
The number of reads identified as bacteriocins and 16S rRNA are used to calculate the richness and abundance of bacteriocins in the WMS dataset as mentioned later.

Programming language and database
BADASS was developed in JAVA (https:// www. oracle. com) and used the Maven tool (https:// maven. apache. org/) to build and manage the project. Maven was used due to the automated management and generation of the JAR package containing the software dependencies. Swing library was used to produce the graphical interface. The database management system used to control the steps and manage the project was SQLite v.3 (https:// www. sqlite. org/).

Data source and software validation
The software validation was performed using four whole-metagenome shotgun sequencing datasets. Samples were obtained in the Tucuruí Hydroelectric Power Plant water reservoir submitted in EBI database under the accession numbers ERS1560860, ERS1560861, and ERS1562591 [8] and a sample obtained from Unai's Hot Spring from the ENA (European Nucleotide Archive) database with the accession number PRJEB8864. The following parameters were used: Quality threshold: 20, minimum length: 100, minimum quality score to keep: 16, minimum percent: 80%, e-value: 10, threads: 6, identity: 50.

Abundance analysis
Diversity of bacteriocins was analyzed in terms of abundance and richness. In order to calculate the abundance of bacteriocins we adapted the formula proposed formula in studies involving the search of abundance of resistance genes in WMS data. [20]. Where: (1) n is the amount of bacteriocins that were found in sample; (2).N bacteriocin sequences is the number of reads that mapped to a specific bacteriocin; (3) T read is the average size of reads; (4) T bacteriocin is the average size of the bacteriocin; (5) N 16S rRNA sequences is the number of reads that mapped to 16S rRNA sequences; and (6) T 16S rRNA sequence is the average size of the 16S rRNA sequence.

Workstation
Analyses were performed in a Desktop equipment Intel ® Core ™ i7-10510U CPU @ 1.80 GHz with 8 processing cores, 16 GB of RAM memory, and tests were run on Ubuntu 21.10, 64-bit, Windows 11 and macOS Ventura 13.0 operating systems.

Results and discussion
BADASS was developed using the BLAST and DIAMOND alignment tools to identify bacteriocin and 16S rRNA sequences in WMS raw data. The choice of tool can be defined in the BADASS GUI. Other studies have used similar methodologies for prospecting relevant genes such as ARGs [7,20,29,30]. For example, ARGs-OAP is an online pipeline for antibiotic resistance genes detection in metagenomic data through similarity sequence analysis [5]. In addition to the homology search, BADASS calculates the abundance of each bacteriocin class by taking into account the size of the genes and the size of the reads produced by the sequencing library [22]. The number of reads identified as bacteriocins are compared to the number of reads identified as the gene of the 16S rRNA, which is present in a few copies per cell. This approach makes the size of the reads as well as the size of the genes not interfere with the analysis of abundance. Thus, this pipeline is a powerful tool for rapid and comprehensive evaluation of bacteriocin diversity using WMS raw data.
The main results of BADASS are the description of richness and the values of the abundance of bacteriocins using WMS raw data as input file in a simple and intuitive way. The software provides a set of adjustable parameters in the graphical interface. Users can also choose to process the samples using default parameters. In more detail, the results obtained by the software include: (1) quality assessment box plots of the raw data directly in the graphical interface or even in the results folder; (2) spreadsheets in.xlsx or.csv format containing information about the identified bacteriocins (richness) including their frequency (ratio between the number of reads identified as bacteriocin and the total number of reads in a sample) and abundance (calculated using the formula previously mentioned), organized by class; (3) bar plots of abundance based on the.csv files; (4) a.csv file containing the list of 16S rRNA sequences identified in the dataset including the value of percentage identity; (5) Trimmed.fastq and QV.fastq files containing the trimmed reads and the reads filtered by minimum size, respectively.
It is also possible to detect bacteriocin in genome sequences using other software. Table 1 presents several computational tools and databases developed to help in the identification of these antimicrobial peptides. The main features of each software or database are compared in the table. It is worth noting that BADASS is the only who has a graphical interface, supports WMS data, and performs diversity analysis (Table 1). BAGEL4 stands out for having one of the most complete databases containing a large number of annotated and experimentally verified bacteriocin sequences. In addition, the database is divided into three classes according to the genetic information and mechanism of action. Because of these features, BAGEL4 was used as a reference bacteriocin database in BADASS. BACTIBASE [30] is a database containing detailed information about the physicochemical properties of bacteriocins. This information allows a fast and accurate prediction of the structure-function relationship and possible target organisms of the antimicrobial peptides. Other relevant software includes BOA (Bacteriocin Operon Associator) [31] which uses Hidden Markov Models to predict bacteriocin clusters, Neubi [32] which identifies bacteriocins using a word embedding approach, and Anti-SMASH (Antibiotics and Secondary Metabolite Analysis Shell) which was launched in 2011 and is used not only for bacteriocin prediction but for a number of other secondary metabolites [18,33].
The identification of bacteriocins, however, is still quite challenging due to the limited number of known and experimentally analyzed sequences. Choosing the most appropriate and up-to-date tool is essential for the search and identification of bacteriocin genes. BADASS is a user-friendly software, with a robust pipeline that starts with the quality assessment of the raw data and ends with the analysis of the richness and abundance of bacteriocins.
A pilot analysis was performed using four datasets with default parameters. The results of the dataset ERR1816708 are presented in Table 2. First column of the table shows the BAGEL4 accession number and name of the bacteriocins. Other columns correspond to frequency, abundance and class, respectively. Thus, users are able to identify the diversity of bacteriocins in the dataset. Additionally, complementary analyses such as taxonomic affiliation of the microbial community are important to determine the ecological context of the described bacteriocins [8].
Two bar plots were generated by the software containing an overview of the bacteriocin diversity. The first plot (Fig. 2) is designed based on a.csv file similar to Table 2. The plot presents the top ten most abundant bacteriocins in the dataset. Information about the bacterial species that commonly produce the peptides are presented in the legend. The second plot (Fig. 3) presents the abundance of bacteriocins by class. The best  parameters for each study should be carefully chosen by the user according to it dataset characteristics. A variation in the results is expected since the parameters adjust the analysis performed by the software. The choice of parameters by the user will result in changes in the result, being able to restrict to stricter or looser parameters [34]. The time required for similarity search and post-alignment analysis has become a bottleneck as sequence costs decrease and the size of the datasets increases [35]. We also highlight that all the analysis, starting from the filtering of the raw data can be done in the BADASS pipeline. The software allows users to modify most of the parameters such as e-value, identity cut-off, and others.

Conclusions
In the environment, a countless number of microbial species coexist and, in order to succeed in colonize their ecological niches, many have developed mechanisms to eliminate other species through the production of antimicrobial molecules. In this chemical warfare, bacteriocins are narrow-spectrum antimicrobial peptides synthesized by ribosomal activity that are widely distributed in bacterial species. Thus, the development of computational tools to identify, classify and quantify bacteriocins in WMS datasets is of great importance for microbial ecology and biotechnology.
BADASS provides the user with a robust and automated computational tool with a simple and intuitive graphical interface, where the parameters can be adjusted by the user, allowing greater independence in the analysis of different samples. The integration of the software with the R statistical platform allows the generation of plots that helps in data interpretation. For those looking to prospect antimicrobial peptides in WMS raw data, BADASS is a powerful solution.

Availability and requirements
Project name: BADASS Project home page: https:// sourc eforge. net/ proje cts/ badass/ Operating system(s): platform independent Programming language: Java Other requirements: e.g. Java 19.0.1 or higher License: GNU GPL Any restrictions to use by non-academics: license needed.